Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Free, publicly-accessible full text available April 6, 2026
-
Continuous speaker separation aims to separate overlapping speakers in real-world environments like meetings, but it often falls short in isolating speech segments of a single speaker. This leads to split signals that adversely affect downstream applications such as automatic speech recognition and speaker diarization. Existing solutions like speaker counting have limitations. This paper presents a novel multi-channel approach for continuous speaker separation based on multi-input multi-output (MIMO) complex spectral mapping. This MIMO approach enables robust speaker localization by preserving inter-channel phase relations. Speaker localization as a byproduct of the MIMO separation model is then used to identify single-talker frames and reduce speaker splitting. We demonstrate that this approach achieves superior frame-level sound localization. Systematic experiments on the LibriCSS dataset further show that the proposed approach outperforms other methods, advancing state-of-the-art speaker separation performance.more » « less
-
When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grouping of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called “speaker separation via neural diarization” (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments—a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin.more » « less
-
Current deep learning based multi-channel speaker sepa- ration methods produce a monaural estimate of speaker sig- nals captured by a reference microphone. This work presents a new multi-channel complex spectral mapping approach that simultaneously estimates the real and imaginary spectrograms of all speakers at all microphones. The proposed multi-input multi-output (MIMO) separation model uses a location-based training (LBT) criterion to resolve the permutation ambiguity in talker-independent speaker separation across microphones. Experimental results show that the proposed MIMO separation model outperforms a multi-input single-output (MISO) speaker separation model with monaural estimates. We also combine the MIMO separation model with a beamformer and a MISO speech enhancement model to further improve separation performance. The proposed approach achieves the state-of-the-art speaker separation on the open LibriCSS dataset.more » « less
An official website of the United States government

Full Text Available